class: center, middle, inverse, title-slide # Covid-19, Global Pandemic, and Data Science ### Team Chrissy & Ricky
Chrissy Aman and Ricky Sun ### Bates College ### 2022-04-13 --- ## Outline <style type="text/css"> .remark-slide-content { font-size: 30px; padding: 1em 4em 1em 4em; } </style> - Introduction - Literature Review - Our Data - Methods & Data Analyses - Results - Limitations and Potential Future Studies --- class: inverse, center, middle background-image: url("images/cool.png") # Introduction --- # Introduction COVID-19, also known as Coronavirus disease 2019 is a contagious disease caused by a virus, the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) Up until yesterday, there are over 497,057,239 infections and 6,179,104 deaths since the beginning of the pandemic Although no one can predict a pandemic like this, with the help of data, we might be able to use available data to, for example, evaluate risks, so that the virus and mutations can be better managed or even contained in earlier stages --- ## Research Question - In our research, we are trying to use Covid-19 related data, together with other relevant data to find potential predictors for Covid-19 cases, deaths, or vaccination rates. ? Do vaccinations effectively mitigate the death rate ? Are higher percentage of older people predicts higher death rate ? What about other variables in predicting Covid-19 ? Can we implement machine learning algorithm to predict Covid-19 --- class: inverse, middle, center # Literature Review --- # [1a] Covid-19 severity .pull-left[ <div class="figure" style="text-align: center"> <img src="https://www.mdpi.com/pathogens/pathogens-09-00817/article_deploy/html/images/pathogens-09-00817-g001-550.jpg" alt="Reference: Mulinari T.O., et al. (2020) MDPI" width="60%" /> <p class="caption">Reference: Mulinari T.O., et al. (2020) MDPI</p> </div> ] .pull-right[ - The unpredictability of the progression of coronavirus disease 2019 (COVID-19) may be attributed to the low precision of the tools used to predict the prognosis of this disease, especially when the virus is mutating in a fast speed from alpha, to Omicron, and there are more recent variants too ] --- # [1b] vitamin D and covid-19 .pull-left[ <div class="figure" style="text-align: center"> <img src="https://www.healio.com/~/media/slack-news/fm_im/misc/infographics/2020/september/pc0920meltzer_graphic_01.jpg?h=630&w=1200&la=en&hash=AA5CE3037B8695511804123FBF351C92" alt="Reference: Meltzer DO, et al. (2021) JAMA Netw Open" width="100%" /> <p class="caption">Reference: Meltzer DO, et al. (2021) JAMA Netw Open</p> </div> ] .pull-right[ - Several studies suggest an association between serum 25-hydroxyvitamin D (25OHD) and the likelihood of suffering severe symptoms of covid-19 ] --- # [2] Covid-19 and weather .pull-left[ - Akin to respiratory tract infection diseases, climatic conditions may significantly influence the COVID-19 pandemic, significant efforts have been made to explore the relationship between climatic condition and growth in number of COVID-19 cases ] .pull-right[ <div class="figure" style="text-align: center"> <img src="https://ars.els-cdn.com/content/image/1-s2.0-S004896972033179X-ga1.jpg" alt="Reference: Mesay Moges Menebo (2020)" width="80%" /> <p class="caption">Reference: Mesay Moges Menebo (2020)</p> </div> ] --- # [3] Covid-19 and social media .pull-left[ - Social media data from, for example, google search, twitter, facebook and other social media platforms, may also be used to develop models and as early warning signals of COVID-19 outbreaks. Social media data can also presents with people's perception of risks and general mental states ] .pull-right[ <div class="figure" style="text-align: center"> <img src="https://api.hub.jhu.edu/factory/sites/default/files/styles/soft_crop_2400/public/social_media_032720.jpg?itok=oJE4T6Tf" alt="Reference: JHU Hub (2020)" width="80%" /> <p class="caption">Reference: JHU Hub (2020)</p> </div> ] --- # [4] COVID-19 and impacts .pull-left[ <div class="figure" style="text-align: center"> <img src="https://cdn-japantimes.com/wp-content/uploads/2020/03/np_file_1549.jpeg" alt="Reference: James G. (2020)" width="100%" /> <p class="caption">Reference: James G. (2020)</p> </div> ] .pull-right[ - Covid-19 also has had great impacts in our daily lives (racial issues, job markets, also economic activities, and so on) ] --- # [5] Covid-19 and machine learning .pull-left[ <div class="figure" style="text-align: center"> <img src="https://www.news-medical.net/image.axd?picture=2021%2F12%2Fshutterstock_1722032125.jpg" alt="Reference: Suchandrima Bhowmik (2021)" width="100%" /> <p class="caption">Reference: Suchandrima Bhowmik (2021)</p> </div> ] .pull-right[ - Developing accurate forecasting tools will help fighting against the pandemic. Prediction models that combine several features to estimate the risk of infection aim to assist medical staff worldwide in helping patients, especially in the context of limited healthcare resources ] --- class: inverse, middle, center # Data --- ## Data - details The dataset is coming from "Our World in Data" Covid-19 public data, together with data from JHU, WHO, CDC and World Bank. The data covers a wide range: - Basic Covid-19 data (cases, deaths) - Hospital & ICU (ICU beds, ICU patients) - Policy responses (stringency_index) - Health info (Reproduction rate, diabetes prevalence) - Tests & positivity - Vaccinations - Others (populations, life_expectancy, GDP per catpita and so on) --- class: inverse, middle, center # Methods & Data Analyses --- # Methods & Data Analyses - details Part I: Preliminary exploration like summary statistics, scatter plots, correlations, maps Part II: Regression analyses, ranging from OLS, fixed effects, to regression continuity Part III: time series (ARiMA) and machine learning algorithm(s) --- class: inverse, middle, center # Results & Implications --- # Results & Implications [1a] Development (HDI) and Death Rate .pull-left[ ``` ## Reading layer `TM_WORLD_BORDERS-0.3' from data source ## `/cloud/project/data/world_shape_file/TM_WORLD_BORDERS-0.3.shp' ## using driver `ESRI Shapefile' ## Simple feature collection with 246 features and 11 fields ## Geometry type: MULTIPOLYGON ## Dimension: XY ## Bounding box: xmin: -180 ymin: -90 xmax: 180 ymax: 83.6236 ## Geodetic CRS: WGS 84 ```
] .pull-right[ ``` ## Reading layer `TM_WORLD_BORDERS-0.3' from data source ## `/cloud/project/data/world_shape_file/TM_WORLD_BORDERS-0.3.shp' ## using driver `ESRI Shapefile' ## Simple feature collection with 246 features and 11 fields ## Geometry type: MULTIPOLYGON ## Dimension: XY ## Bounding box: xmin: -180 ymin: -90 xmax: 180 ymax: 83.6236 ## Geodetic CRS: WGS 84 ```
] --- # Results & Implications [1b] .pull-left[ ``` ## Selecting by human_development_index ``` ``` ## # A tibble: 10 × 2 ## location human_development_index ## <chr> <dbl> ## 1 Norway 0.957 ## 2 Ireland 0.955 ## 3 Switzerland 0.955 ## 4 Hong Kong 0.949 ## 5 Iceland 0.949 ## 6 Germany 0.947 ## 7 Sweden 0.945 ## 8 Australia 0.944 ## 9 Netherlands 0.944 ## 10 Denmark 0.94 ``` ] .pull-right[ ``` ## Selecting by total_cases ``` ``` ## # A tibble: 10 × 2 ## location total_cases ## <chr> <dbl> ## 1 United States 78477217 ## 2 India 42838524 ## 3 Brazil 28218180 ## 4 France 22339467 ## 5 United Kingdom 18664704 ## 6 Russia 15147762 ## 7 Germany 13667353 ## 8 Turkey 13504485 ## 9 Italy 12469975 ## 10 Spain 10809222 ``` ] --- # Results & Implications [1b] .pull-left[ ``` ## Selecting by vaccination ``` ``` ## # A tibble: 10 × 2 ## location vaccination ## <chr> <dbl> ## 1 United Arab Emirates 0.945 ## 2 Portugal 0.915 ## 3 Singapore 0.898 ## 4 Malta 0.894 ## 5 Chile 0.893 ## 6 Cuba 0.872 ## 7 South Korea 0.864 ## 8 Spain 0.831 ## 9 Denmark 0.815 ## 10 Cambodia 0.815 ``` ] .pull-right[ ``` ## Selecting by death_rate ``` ``` ## # A tibble: 10 × 2 ## location death_rate ## <chr> <dbl> ## 1 Yemen 0.181 ## 2 Vanuatu 0.0909 ## 3 Sudan 0.0639 ## 4 Peru 0.0599 ## 5 Mexico 0.0583 ## 6 Syria 0.0568 ## 7 Somalia 0.0512 ## 8 Egypt 0.0506 ## 9 Afghanistan 0.0438 ## 10 Ecuador 0.0434 ``` ] --- # Results & Implications [1c] <img src="presentation_files/figure-html/correlation_heatmap-1.png" width="80%" /> --- # Results & Implications [1d alt text notes] <img src="presentation_files/figure-html/vaccination-HDI-1.png" title="scatterplot of flipper length by bill length of 3 penguin species, where we show penguins with bigger flippers have bigger bills" alt="scatterplot of flipper length by bill length of 3 penguin species, where we show penguins with bigger flippers have bigger bills" width="80%" /> --- # Results & Implications [1d] <img src="presentation_files/figure-html/vac-death-1.png" width="80%" /> --- # Results & Implications - [2a] ``` ## # A tibble: 5 × 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) -550. 825. -0.667 0.522 ## 2 vaccination -559. 1558. -0.359 0.728 ## 3 stringency_index 9.07 15.3 0.593 0.567 ## 4 handwashing_facilities 10.8 12.7 0.850 0.418 ## 5 gdp_per_capita 0.0748 0.0367 2.04 0.0719 ``` ``` ## [1] 0.5955702 ``` ``` ## [1] 0.4158236 ``` --- # Results & Implications - [2a] ``` ## # A tibble: 5 × 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 441. 551. 0.799 0.469 ## 2 vaccination -3193. 3904. -0.818 0.459 ## 3 stringency_index -10.9 8.59 -1.27 0.272 ## 4 handwashing_facilities -4.23 9.59 -0.441 0.682 ## 5 gdp_per_capita 0.118 0.0379 3.11 0.0359 ``` ``` ## [1] 0.757986 ``` ``` ## [1] 0.5159721 ``` --- # Important Dates Covid-19 Alpha: 31 December 2019 (January 22, 2020) Covid-19 Delta: 1 December 2020 Covid-19 Omicron: 1 November, 2021 ----------- Vaccination: 13 December, 2020 Booster: 13 August, 2021 --- # Results & Implications - [2a] ``` ## # A tibble: 5 × 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) -550. 825. -0.667 0.522 ## 2 vaccination -559. 1558. -0.359 0.728 ## 3 stringency_index 9.07 15.3 0.593 0.567 ## 4 handwashing_facilities 10.8 12.7 0.850 0.418 ## 5 gdp_per_capita 0.0748 0.0367 2.04 0.0719 ``` ``` ## [1] 0.5955702 ``` ``` ## [1] 0.4158236 ``` --- # Results & Implications - [2a] ``` ## # A tibble: 5 × 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 441. 551. 0.799 0.469 ## 2 vaccination -3193. 3904. -0.818 0.459 ## 3 stringency_index -10.9 8.59 -1.27 0.272 ## 4 handwashing_facilities -4.23 9.59 -0.441 0.682 ## 5 gdp_per_capita 0.118 0.0379 3.11 0.0359 ``` ``` ## [1] 0.757986 ``` ``` ## [1] 0.5159721 ``` --- # Results & Implications [2b] <img src="presentation_files/figure-html/ARiMA_1a-1.png" width="90%" height="40%" /> --- <img src="presentation_files/figure-html/ARiMA_1b-1.png" width="90%" height="40%" /> --- <img src="presentation_files/figure-html/ARiMA_2-1.png" width="90%" height="90%" /> ``` ## ## Autocorrelations of series 'cases_diff', by lag ## ## 0 1 2 3 4 5 6 7 8 9 10 ## 1.000 -0.326 -0.245 0.095 0.090 -0.242 -0.149 0.608 -0.081 -0.230 0.004 ## 11 12 13 14 15 16 17 18 19 20 ## 0.121 -0.178 -0.231 0.612 -0.130 -0.168 -0.003 0.113 -0.211 -0.133 ``` --- <img src="presentation_files/figure-html/ARiMA_3-1.png" width="90%" height="90%" /> ``` ## ## Partial autocorrelations of series 'cases_diff', by lag ## ## 1 2 3 4 5 6 7 8 9 10 11 ## -0.326 -0.393 -0.192 -0.056 -0.300 -0.528 0.247 0.347 0.332 0.150 0.129 ## 12 13 14 15 16 17 18 19 20 ## 0.159 -0.234 0.132 -0.124 -0.082 -0.088 -0.027 -0.123 -0.091 ``` --- ``` ## Series: cases_diff ## ARIMA(2,0,2) with zero mean ## ## Coefficients: ## ar1 ar2 ma1 ma2 ## 0.9022 -0.4360 -1.6963 0.8972 ## s.e. 0.0377 0.0421 0.0288 0.0184 ## ## sigma^2 = 4.093e+09: log likelihood = -9525.58 ## AIC=19061.16 AICc=19061.24 BIC=19084.35 ``` ``` ## ## Call: ## arima(x = covid_time$new_cases, order = c(2, 0, 2)) ## ## Coefficients: ## ar1 ar2 ma1 ma2 intercept ## 0.3710 0.6069 0.0512 -0.5558 103166.49 ## s.e. 0.0803 0.0790 0.0687 0.0449 52848.56 ## ## sigma^2 estimated as 5.029e+09: log likelihood = -9618.34, aic = 19248.68 ``` --- # Results & Implications [2b] <img src="presentation_files/figure-html/ARiMA_7-1.png" width="80%" /> ``` ## ## Ljung-Box test ## ## data: Residuals from ARIMA(2,0,2) with non-zero mean ## Q* = 258.4, df = 5, p-value < 2.2e-16 ## ## Model df: 5. Total lags used: 10 ``` --- <img src="presentation_files/figure-html/ARiMA8-1.png" width="80%" /> --- # Results & Implications - [2c] Fixed effect ```r #covid_data2 <- covid_data1 %>% # select(vaccine_introduced, total_vaccinations, date, first_vaccination, interaction_term, #location, new_cases_per_million, weekly_hosp_admissions_per_million) ``` ```r #covid_data3 <- covid_data1 %>% # select(date, location, new_cases_per_million, weekly_hosp_admissions_per_million, people_fully_vaccinated_per_hundred, new_deaths_per_million) %>% na.omit(people_fully_vaccinated_per_hundred, population_density, total_cases_per_million) ``` ```r #covid_data7 <- covid_data3 %>% # filter(location == "United States" | location == "United Kingdom" | location == "Canada" | location == "Belgium" | location == "Israel") ``` ```r #fixed.dum <-lm(weekly_hosp_admissions_per_million ~ people_fully_vaccinated_per_hundred + factor(location) - 1, data = covid_data7) #summary(fixed.dum) #yhat <- fixed.dum$fitted #scatterplot(yhat ~ covid_data7$people_fully_vaccinated_per_hundred | covid_data7$location, xlab ="people fully vaccinated per hundred", ylab ="covid cases", boxplots = FALSE, smooth = FALSE) #abline(lm(covid_data7$weekly_hosp_admissions_per_million~covid_data7$people_fully_vaccinated_per_hundred),lwd=5, col="red") ``` --- # Results & Implications [2d] Regression Discontinuity ``` ## ## Call: ## lm(formula = y ~ ., data = dat_step1, weights = weights) ## ## Residuals: ## Min 1Q Median 3Q Max ## -4.394 -3.513 -1.533 1.405 19.208 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 5.171 2.684 1.927 0.0596 . ## D -1.279 3.969 -0.322 0.7485 ## x 16.107 35.252 0.457 0.6497 ## x_right -23.037 36.680 -0.628 0.5328 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 5.42 on 51 degrees of freedom ## (160 observations deleted due to missingness) ## Multiple R-squared: 0.01997, Adjusted R-squared: -0.03768 ## F-statistic: 0.3464 on 3 and 51 DF, p-value: 0.7919 ``` --- <img src="presentation_files/figure-html/rdd_plot1-1.png" width="80%" /> --- ``` ## ## Call: ## lm(formula = y ~ ., data = dat_step1, weights = weights) ## ## Residuals: ## Min 1Q Median 3Q Max ## -2.6545 -2.0264 -1.3366 0.9876 15.0801 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 2.5810 0.8455 3.052 0.00303 ** ## D -0.7295 1.3692 -0.533 0.59556 ## x 2.0274 2.6807 0.756 0.45155 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 3.23 on 85 degrees of freedom ## (128 observations deleted due to missingness) ## Multiple R-squared: 0.007168, Adjusted R-squared: -0.01619 ## F-statistic: 0.3069 on 2 and 85 DF, p-value: 0.7366 ``` --- <img src="presentation_files/figure-html/rdd_plot2-1.png" width="80%" /> --- # Results & Implications [3] 1. logistic regression 2. KNN (clusters) & K-Means Clustering 3. Random Forest 4. Support Vector Machine 5. Decision Trees --- # Results & Implications What is kNN Algorithm? Let’s assume we have several groups of labeled samples. The items present in the groups are homogeneous in nature. Now, suppose we have an unlabeled example which needs to be classified into one of the several labeled groups --- # Results & Implications ``` ## ## ## Cell Contents ## |-------------------------| ## | N | ## | N / Row Total | ## | N / Col Total | ## | N / Table Total | ## |-------------------------| ## ## ## Total Observations in Table: 31 ## ## ## | knn_test_pred ## test | more developed country | less developed country | Row Total | ## -----------------------|------------------------|------------------------|------------------------| ## more developed country | 28 | 0 | 28 | ## | 1.000 | 0.000 | 0.903 | ## | 0.933 | 0.000 | | ## | 0.903 | 0.000 | | ## -----------------------|------------------------|------------------------|------------------------| ## less developed country | 2 | 1 | 3 | ## | 0.667 | 0.333 | 0.097 | ## | 0.067 | 1.000 | | ## | 0.065 | 0.032 | | ## -----------------------|------------------------|------------------------|------------------------| ## Column Total | 30 | 1 | 31 | ## | 0.968 | 0.032 | | ## -----------------------|------------------------|------------------------|------------------------| ## ## ``` --- class: inverse, middle, center # Limitations and Potential Future Studies --- # Limitations 1. Data Point (county-level data) 2. Standardization of data 3. Mutations of Covid-19, types of vaccinations 4. Policies 5. Prediction at a global level may not be as useful --- # Future Studies 1. Tracking recovery for children and elder population (sequelae, mental health) 2. Studying the transitioning in traditional industries (airlines, manufacturing) 3. Anti-trust activities during Covid-19 4. County level data --- class: inverse, middle, center # Thank you! --- class: inverse, middle, center # Q & A